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Abstract. Most traditional artificial intelligence (AI) systems of the 
past 50 years are either very limited, or based on heuristics, or both. The 
new millennium, however, has brought substantial progress in the field of 
theoretically optimal and practically feasible algorithms for prediction, 
search, inductive inference based on Occam's razor, problem solving, 
decision making, and reinforcement learning in environments of a very 
general type. Since inductive inference is at the heart of all inductive 
sciences, some of the results are relevant not only for AI and computer 
science but also for physics, provoking nontraditional predictions based 
on Zuse's thesis of the computer-generated universe. 

1 Introduction 

Remarkably, there is a theoretically optimal way of making predictions based on 
observations, rooted in the early work of SolomonofF and Kolmogorov |62I28| . 
The approach reflects basic principles of Occam's razor: simple explanations of 
data are preferable to complex ones. 

The theory of universal inductive inference quantifies what simplicity really 
means. Given certain very broad computability assumptions, it provides tech- 
niques for making optimally reliable statements about future events, given the 
past. 

Once there is an optimal, formally describable way of predicting the future, 
we should be able to construct a machine that continually computes and executes 
action sequences that maximize expected or predicted reward, thus solving an 
ancient goal of AI research. 

For many decades, however, AI researchers have not paid a lot of attention to 
the theory of inductive inference. Why not? There is another reason besides the 
fact that most of them have traditionally ignored theoretical computer science: 
the theory has been perceived as being associated with excessive computational 
costs. In fact, its most general statements refer to methods that are optimal 
(in a certain asymptotic sense) but incomputable. So researchers in machine 



learning and artificial intelligence have often resorted to alternative methods 
that lack a strong theoretical foundation but at least seem feasible in certain 
limited contexts. For example, since the early attempts at building a "General 
Problem Solver" 36 43 much work has been done to develop mostly heuristic 
machine learning algorithms that solve new problems based on experience with 
previous problems. Many pointers to learning by chunking, learning by macros, 
hierarchical learning, learning by analogy, etc. can be found in Mitchell's book 
[51] and Kaelbling's survey [77| . 

Recent years, however, have brought substantial progress in the field of com- 
putable and feasible variants of optimal algorithms for prediction, search, induc- 
tive inference, problem solving, decision making, and reinforcement learning in 
very general environments. In what follows I will focus on the results obtained 
at IDSIA. 

Sections 01 El 13 relate Occam's razor and the notion of simplicity to the 
shortest algorithms for computing computable objects, and will concentrate on 
recent asymptotic optimality results for universal learning machines, essentially 
ignoring issues of practical feasibility — compare Hutter's contribution [21] in this 
volume. 

Sectional however, will focus on our recent non-traditional simplicity mea- 
sure which is not based on the shortest but on the fastest way of describing 
objects, and Sectionl^lwill use this measure to derive non-traditional predictions 
concerning the future of our universe. 

Sections 151 151 1 1 01 will finally address quite pragmatic issues and "true" time- 
optimality: given a problem and only so much limited computation time, what 
is the best way of spending it on evaluating solution candidates? In particular, 
Section will outline a bias-optimal way of incrementally solving each task in 
a sequence of tasks with quickly verifiable solutions, given a probability distri- 
bution (the bias) on programs computing solution candidates. Bias shifts are 
computed by program prefixes that modify the distribution on their suffixes by 
reusing successful code for previous tasks (stored in non-modifiable memory). 
No tested program gets more runtime than its probability times the total search 
time. In illustrative experiments, ours becomes the first general system to learn 
a universal solver for arbitrary n disk Towers of Hanoi tasks (minimal solution 
size 2" — 1). It demonstrates the advantages of incremental learning by profiting 
from previously solved, simpler tasks involving samples of a simple context free 
language. Sections EH discusses how to use this approach for building general 
reinforcement learners. 

Finally, Section 1111 will summarize the recent Godel machine |56| , a self- 
referential, theoretically optimal self-improver which explicitly addresses the 
'Grand Problem of Artificial Intelligence' |58| by optimally dealing with limited 
resources in general reinforcement learning settings. 



2 More Formally 



What is the optimal way of predicting the future, given the past? Which is the 
best way to act such as to maximize one's future expected reward? Which is the 
best way of searching for the solution to a novel problem, making optimal use 
of solutions to earlier problems? 

Most previous work on these old and fundamental questions has focused on 
very limited settings, such as Markovian environments where the optimal next 
action, given past inputs, depends on the current input only |27| . 

We will concentrate on a much weaker and therefore much more general 
assumption, namely, that the environment's responses are sampled from a com- 
putable probability distribution. If even this weak assumption were not true 
then we could not even formally specify the environment, leave alone writing 
reasonable scientific papers about it. 

Let us first introduce some notation. B* denotes the set of finite sequences 
over the binary alphabet B — {0, 1}, B°° the set of infinite sequences over B, 
A the empty string, B'^ = B* U B°° . x^y^z^z^^z^ stand for strings in B". If 
X G B* then xy is the concatenation of x and y (e.g., if x = 10000 and y = 1111 
then xy = 100001111). For x G B* , l{x) denotes the number of bits in x, where 
l{x) = oo for a; e B°°; 1{X) = 0. Xn is the prefix of x consisting of the first n 
bits, if l{x) > n, and x otherwise {xo '■= A), log denotes the logarithm with basis 
2, f,g denote functions mapping integers to integers. We write f{n) = 0{g{n)) 
if there exist positive constants c, hq such that f{n) < cg{n) for all n > uq. 
For simplicity let us consider universal Turing Machines |67| (TMs) with input 
alphabet B and trinary output alphabet including the symbols "0" , "1" , and " " 
(blank). For efficiency reasons, the TMs should have several work tapes to avoid 
potential quadratic slowdowns associated with 1-tape TMs. The remainder of 
this paper assumes a fixed universal reference TM. 

Now suppose bitstring x represents the data observed so far. What is its most 
likely continuation y G B^l Bayes' theorem yields 

, N Pix I xy)P(xy) ^ 
P{xy x) - ^ ' f\ ^ « P{xy) 1 
P{x) 

where P(z^ | z^) is the probability of z^, given knowledge of z^, and P{x) = 
IzeBO P{^'^)dz is just a normalizing factor. So the most likely continuation y 
is determined by P{xy), the prior probability of xy. But which prior measure 
P is plausible? Occam's razor suggests that the "simplest" y should be more 
probable. But which exactly is the "correct" definition of simplicity? Sections |31 
and ^ will measure the simplicity of a description by its length. Section |S1 will 
measure the simplicity of a description by the time required to compute the 
described object. 



3 Prediction Using a Universal Algorithmic Prior Based 
on the Shortest Way of Describing Objects 

Roughly fourty years ago SolomonofF started the theory of universal optimal 
induction based on the apparently harmless simplicity assumption that P is 
computable |^. While Equation ^ makes predictions of the entire future, given 
the past, Solomonoff [SSj focuses just on the next bit in a sequence. Although 
this provokes surprisingly nontrivial problems associated with translating the 
bitwise approach to alphabets other than the binary one — this was achieved 
only recently [201 — it is sufficient for obtaining essential insights. Given an 
observed bitstring x, Solomonoff assumes the data are drawn according to a 
recursive measure /i; that is, there is a program for a universal Turing machine 
that reads x £ B* and computes fj,{x) and halts. He estimates the probability 
of the next bit (assuming there will be one), using the remarkable, well-studied, 
enumerable prior M |K2l77l63ll51ST] 

M(x)= (2) 

program prefix p con-ipxites 
output startin g with x 

M is universal^ dominating the less general recursive measures as follows: For 
all X £ B*, 

M{x)>c^,fi{x) (3) 

where is a constant depending on /i but not on x. Solomonoff observed that 
the conditional Af -probability of a particular continuation, given previous ob- 
servations, converges towards the unknown conditional as the observation size 
goes to infinity and that the sum over all observation sizes of the corre- 
sponding /u-expected deviations is actually bounded by a constant. Hutter (on 
the author's SNF research grant ""Unification of Universal Induction and Se- 
quential Decision Theory" ) recently showed that the number of prediction errors 
made by universal Solomonoff prediction is essentially bounded by the number 
of errors made by any other predictor, including the optimal scheme based on 
the true n EDI- 

Recent Loss Bounds for Universal Prediction. A more general recent 
result is this. Assume we do know that p is in some set P of distributions. Choose 
a fixed weight Wq for each g in P such that the Wq add up to 1 (for simplicity, let 
P be countable). Then construct the Bayesmix M{x) = J2q Wqq{x), and predict 
using M instead of the optimal but unknown p. How wrong is it to do that? The 
recent work of Hutter provides general and sharp (!) loss bounds ^T] : 

Let LM{n) and Lp(n) be the total expected unit losses of the M-predictor 
and the p-predictor, respectively, for the first n events. Then LM{n) — Lp{n) 
is at most of the order of a/ Lp{n). That is, M is not much worse than p. And 
in general, no other predictor can do better than that! In particular, if p is 
deterministic, then the M-predictor soon won't make any errors any more. 

If P contains all recursively computable distributions, then M becomes the 
celebrated enumerable universal prior. That is, after decades of somewhat stag- 



nating research we now have sharp loss bounds for SolomonofF's universal induc- 
tion scheme (compare recent work of Merhav and Feder |33| ) . 

Solomonoff's approach, however, is uncomputable. To obtain a feasible ap- 
proach, reduce M to what you get if you, say, just add up weighted estimated 
future finance data probabilities generated by 1000 commercial stock-market 
prediction software packages. If only one of the probability distributions hap- 
pens to be close to the true one (but you do not know which) you still should 
get rich. 

Note that the approach is much more general than what is normally done in 
traditional statistical learning theory, e.g., j69| . where the often quite unrealistic 
assumption is that the observations are statistically independent. 

4 Super Omegas and Generalizations of Kolmogorov 
Complexity & Algorithmic Probability 

Our recent research generalized Solomonoff's approach to the case of less re- 
strictive nonenumerable universal priors that are still computable in the limit 

An object X is formally describable if a finite amount of information com- 
pletely describes X and only X. More to the point, X should be representable 
by a possibly infinite bitstring x such that there is a finite, possibly never halting 
program p that computes x and nothing but x in a way that modifies each out- 
put bit at most finitely many times; that is, each finite beginning of x eventually 
converges and ceases to change. This constructive notion of formal describabil- 
ity is less restrictive than the traditional notion of computability [HZIj mainly 
because we do not insist on the existence of a halting program that computes 
an upper bound of the convergence time of p's n-th output bit. Formal de- 
scribability thus pushes constructivism |5I1| to the extreme, barely avoiding the 
nonconstructivism embodied by even less restrictive concepts of describability 
(compare computability in the limit jl7l4(Jll4| and Z\^-describability |42||31[ p. 
46-47]). 

The traditional theory of inductive inference focuses on Turing machines 
with one-way write-only output tape. This leads to the universal enumerable 
Solomonoff-Levin (semi) measure. We introduced more general, nonenumerable, 
but still limit-computable measures and a natural hierarchy of generalizations 
of algorithmic probability and Kolmogorov complexity {50l52j . suggesting that 
the "true" information content of some (possibly infinite) bitstring x actually 
is the size of the shortest nonhalting program that converges to x and nothing 
but a; on a Turing machine that can edit its previous outputs. In fact, this 
"true" content is often smaller than the traditional Kolmogorov complexity. We 
showed that there are Super Omegas computable in the limit yet more random 
than Chaitin's "number of wisdom" Omega PI (which is maximally random in a 
weaker traditional sense) , and that any approximable measure of x is small for 
any x lacking a short description. 



We also showed that there is a universal cumulatively enumerable measure of 
X based on the measure of all enumerable y lexicographically greater than x. It is 
more dominant yet just as limit-computable as Solomonoff's |52| . That is, if we 
are interested in limit-computable universal measures, we should prefer the novel 
universal cumulatively enumerable measure over the traditional enumerable one. 
If we include in our Bayesmix such limit-computable distributions we obtain 
again sharp loss bounds for prediction based on the mix |50i.52) . 

Our approach highlights differences between countable and uncountable sets. 
Which are the potential consequences for physics? We argue that things such 
as wncountable time and space and mcomputable probabilities actually should 
not play a role in explaining the world, for lack of evidence that they are re- 
ally necessary |50j . Some may feel tempted to counter this line of reasoning by 
pointing out that for centuries physicists have calculated with continua of real 
numbers, most of them incomputable. Even quantum physicists who are ready 
to give up the assumption of a continuous universe usually do take for granted 
the existence of continuous probability distributions on their discrete universes, 
and Stephen Hawking explicitly said: "Although there have been suggestions that 
space-time may have a discrete structure I see no reason to abandon the contin- 
uum theories that have been so successful. " Note, however, that all physicists in 
fact have only manipulated discrete symbols, thus generating finite, describable 
proofs of their results derived from enumerable axioms. That real numbers re- 
ally exist in a way transcending the finite symbol strings used by everybody may 
be a figment of imagination — compare Brouwer's constructive mathematics 
| 5I1 | and the Lowenheim-Skolem Theorem 32'61 which implies that any first 
order theory with an uncountable model such as the real numbers also has a 
countable model. As Kronecker put it: "Die game Zahl schuf der liebe Gott, 
alles Ubrige ist Menschenwerk" ("God created the integers, all else is the work 
of man" Kronecker greeted with scepticism Cantor's celebrated insight [7| 
about real numbers, mathematical objects Kronecker believed did not even exist. 

Assuming our future lies among the few (countably many) describable fu- 
tures, we can ignore uncountably many nondescribable ones, in particular, the 
random ones. Adding the relatively mild assumption that the probability distri- 
bution from which our universe is drawn is cumulatively enumerable provides 
a theoretical justification of the prediction that the most likely continuations 
of our universes are computable through short enumeration procedures. In this 
sense Occam's razor is just a natural by-product of a computability assumption! 
But what about falsifiability? The pseudorandomness of our universe might be 
effectively undetectable in principle, because some approximable and enumerable 
patterns cannot be proven to be nonrandom in recursively bounded time. 

The next sections, however, will introduce additional plausible assumptions 
that do lead to computable optimal prediction procedures. 



5 Computable Predictions through the Speed Prior 
Based on the Fastest Way of Describing Objects 

Unfortunately, while M and the more general priors of Section^lare computable 
in the limit, they are not recursive, and thus practically infeasible. This draw- 
back inspired less general yet practically more feasible principles of minimum 
description length (MDL) |71I41| as well as priors derived from time-bounded 
restrictions [2] of Kolmogorov complexity |28I62I9| . No particular instance of 
these approaches, however, is universally accepted or has a general convincing 
motivation that carries beyond rather specialized application scenarios. For in- 
stance, typical efficient MDL approaches require the specification of a class of 
computable models of the data, say, certain types of neural networks, plus some 
computable loss function expressing the coding costs of the data relative to the 
model. This provokes numerous ad-hoc choices. 

Our recent work [SJ, however, offers an alternative to the celebrated but 
noncomputable algorithmic simplicity measure or Solomonoff-Levin measure dis- 
cussed above |62I77I63| . We introduced a new measure (a prior on the computable 
objects) which is not based on the shortest but on the fastest way of describing 
objects. 

Let us assume that the observed data sequence is generated by a compu- 
tational process, and that any possible sequence of observations is therefore 
computable in the limit 50 . This assumption is stronger and more radical than 
the traditional one: Solomonoff just insists that the probability of any sequence 
prefix is recursively computable, but the (infinite) sequence itself may still be 
generated probabilistically. 

Given our starting assumption that data are deterministically generated by 
a machine, it seems plausible that the machine suffers from a computational 
resource problem. Since some things are much harder to compute than others, 
the resource-oriented point of view suggests the following postulate. 

Postulate 1 The cumulative prior probability measure of all x incomputable 
within time t by any method is at most inversely proportional to t. 

This postulate leads to the Speed Prior S{x), the probability that the output of 
the following probabilistic algorithm starts with x |54| : 

Initialize: Set t :— 1. Let the input scanning head of a universal TM 
point to the first cell of its initially empty input tape. 
Forever repeat: While the number of instructions executed so far ex- 
ceeds t: toss an unbiased coin; if heads is up set t := 2t; otherwise exit. 
If the input scanning head points to a cell that already contains a bit, 
execute the corresponding instruction (of the growing self-delimiting pro- 
gram, e.g., |3(JI31p . Else toss the coin again, set the cell's bit to 1 if heads 
is up (0 otherwise), and set t := t/2. 

Algorithm GUESS is very similar to a probabilistic search algorithm used in 
previous work on applied inductive inference |47l49j . On several toy problems it 



generalized extremely well in a way unmatchable by traditional neural network 
learning algorithms. 

With S comes a computable method AS for predicting optimally within e 
accuracy (54) . Consider a finite but unknown program p computing y G B°°. 
What if Postulate n holds but p is not optimally efficient, and/or computed on 
a computer that differs from our reference machine? Then we effectively do not 
sample beginnings yk from S but from an alternative semimeasure S' . Can we 
still predict well? Yes, because the Speed Prior S dominates S' . This dominance 
is all we need to apply the recent loss bounds [2T) . The loss that we are expected 
to receive by predicting according to AS instead of using the true but unknown 
S' does not exceed the optimal loss by much |54| . 

6 Speed Prior-Based Predictions for Our Universe 

"Id the beginning was the code." 
First sentence of the Great Programmer's Bible 

Physicists and economists and other inductive scientists make predictions 
based on observations. Astonishingly, however, few physicists are aware of the 
theory of optimal inductive inference |62I28| . In fact, when talking about the very 
nature of their inductive business, many physicists cite rather vague concepts 
such as Popper's falsifiability instead of referring to quantitative results. 

All widely accepted physical theories, however, are accepted not because they 
are falsifiable — they are not — or because they match the data — many alternative 
theories also match the data — but because they are simple in a certain sense. For 
example, the theory of gravitation is induced from locally observable training 
examples such as falling apples and movements of distant light sources, presum- 
ably stars. The theory predicts that apples on distant planets in other galaxies 
will fall as well. Currently nobody is able to verify or falsify this. But everybody 
believes in it because this generalization step makes the theory simpler than al- 
ternative theories with separate laws for apples on other planets. The same holds 
for superstring theory ^] or Everett's many world theory which presently 
also are neither verifiable nor falsifiable, yet offer comparatively simple explana- 
tions of numerous observations. In particular, most of Everett's postulated many 
worlds will remain unobservable forever, but the assumption of their existence 
simplifies the theory, thus making it more beautiful and acceptable. 

In Sections 13 and ^ we have made the assumption that the probabilities 
of next events, given previous events, are (limit-)computable. Here we make a 
stronger assumption by adopting Zuse's thesis |75l76j . namely, that the very 
universe is actually being computed deterministically, e.g., on a cellular automa- 
ton (CA) |68I7()| . Quantum physics, quantum computation |8I1()I88| . Heisen- 
berg's uncertainty principle and Bell's inequality 2 do not imply any physical 
evidence against this possibility, e.g., [5^ . 

But then which is our universe's precise algorithm? The following method 
0H1 does compute it: 



Systematically create and execute all programs for a universal computer, 
such as a Turing machine or a CA; the first program is run for one 
instruction every second step on average, the next for one instruction 
every second of the remaining steps on average, and so on. 

This method in a certain sense implements the simplest theory of everything: all 
computable universes, including ours and ourselves as observers, are computed 
by the very short program that generates and executes all possible programs 
|48|. In nested fashion, some of these programs will execute processes that again 
compute all possible universes, etc. Of course, observers in "higher-level" 
universes may be completely unaware of observers or universes computed by 
nested processes, and vice versa. For example, it seems hard to track and inter- 
pret the computations performed by a cup of tea. 

The simple method above is more efficient than it may seem at first glance. 
A bit of thought shows that it even has the optimal order of complexity. For 
example, it outputs our universe history as quickly as this history's fastest pro- 
gram, save for a (possibly huge) constant slowdown factor that does not depend 
on output size. 

Nevertheless, some universes are fundamentally harder to compute than oth- 
ers. This is reflected by the Speed Prior S discussed above (Section So let 
us assume that our universe's history is sampled from 5 or a less dominant 
prior reflecting suboptimal computation of the history. Now we can immediately 
predict: 

1. Our universe will not get many times older than it is now |50| — essentially, 
the probability that it will last 2" times longer than it has lasted so far is at 
most 2"". 

2. Any apparent randomness in any physical observation must be due to 
some yet unknown but fast pseudo-random generator PRG |5U| which we should 
try to discover. 2a. A re-examination of beta decay patterns may reveal that a 
very simple, fast, but maybe not quite trivial PRG is responsible for the appar- 
ently random decays of neutrons into protons, electrons and antineutrinos. 2b. 
Whenever there are several possible continuations of our universe correspond- 
ing to different Schrodinger wave function collapses — compare Everett's widely 
accepted many worlds hypothesis jl2| — we should be more likely to end up in 
one computable by a short and fast algorithm. A re-examination of split experi- 
ment data involving entangled states such as the observations of spins of initially 
close but soon distant particles with correlated spins might reveal unexpected, 
nonobvious, nonlocal algorithmic regularity due to a fast PRG. 

3. Large scale quantum computation |S] will not work well, essentially be- 
cause it would require too many exponentially growing computational resources 
in interfering "parallel universes" |12j . 

4. Any probabilistic algorithm depending on truly random inputs from the 
environment will not scale well in practice. 

Prediction 2 is verifiable but not necessarily falsifiable within a fixed time 
interval given in advance. Still, perhaps the main reason for the current absence 
of empirical evidence in this vein is that few have looked for it. 



In recent decades several well-known physicists have started writing about 
topics of computer science, e.g., |38I1U| . sometimes suggesting that real world 
physics might allow for computing things that are not computable traditionally. 
Unimpressed by this trend, computer scientists have argued in favor of the oppo- 
site: since there is no evidence that we need more than traditional computability 
to explain the world, we should try to make do without this assumption, e.g., 
[75.76.13115) . 



7 Optimal Rational Decision Makers 

So far we have talked about passive prediction, given the observations. Note, 
however, that agents interacting with an environment can also use predictions 
of the future to compute action sequences that maximize expected future re- 
ward. Hutter's recent AIXI model [21] (author's SNF grant 61847) does exactly 
this, by combining Solomonoff's M-based universal prediction scheme with an 
expectimax computation. 

In cycle t action yt results in perception Xt and reward r^, where all quanti- 
ties may depend on the complete history. The perception and reward rt are 
sampled from the (reactive) environmental probability distribution /x. Sequential 
decision theory shows how to maximize the total expected reward, called value, 
if /i is known. Reinforcement learning |27j is used if fi is unknown. AIXI defines 
a mixture distribution ^ as a weighted sum of distributions ly ^ Ai, where Ai is 
any class of distributions including the true environment fi. 

It can be shown that the conditional M probability of environmental inputs 
to an AIXI agent, given the agent's earlier inputs and actions, converges with 
increasing length of interaction against the true, unknown probability "22', as 
long as the latter is recursively computable, analogously to the passive prediction 
case. 

Recent work [2J also demonstrated AIXI's optimality in the following sense. 
The Bayes-optimal policy based on the mixture ^ is self-optimizing in the 
sense that the average value converges asymptotically for all /i G to the 
optimal value achieved by the (infeasible) Bayes-optimal policy which knows 
H in advance. The necessary condition that Ai admits self-optimizing policies is 
also sufRcient. No other structural assumptions are made on Ai. Furthermore, 
p^ is Pareto-optimal in the sense that there is no other policy yielding higher or 
equal value in all environments v € Ai and a strictly higher value in at least one 

m 

We can modify the AIXI model such that its predictions are based on the 
e-approximable Speed Prior 5* instead of the incomputable M. Thus we obtain 
the so-called AIS model. Using Hutter's approach [22] we can now show that 
the conditional S probability of environmental inputs to an AIS agent, given 
the earlier inputs and actions, converges to the true but unknown probability, 
as long as the latter is dominated by S, such as the 5" above. 



8 Optimal Universal Search Algorithms 

In a sense, searching is less general than reinforcement learning because it does 
not necessarily involve predictions of unseen data. Still, search is a central as- 
pect of computer science (and any reinforcement learner needs a searcher as 
a submodule — see Sections IIUI and Surprisingly, however, many books on 
search algorithms do not even mention the following, very simple asymptotically 
optimal, "universal" algorithm for a broad class of search problems. 

Define a probability distribution P on a finite or infinite set of programs for 
a given computer. P represents the searcher's initial bias (e.g., P could be based 
on program length, or on a probabilistic syntax diagram). 

Method LSEARCH: Set current time hmit T=l. While problem not 
solved DO: 

Test all programs q such that t{q), the maximal time spent on 
creating and running and testing q, satisfies t{q) < P{q) T. Set 
T := 2T. 

LSEARCH (for Levin Search) may be the algorithm Levin was referring to in his 
2 page paper 29 which states that there is an asymptotically optimal universal 
search method for problems with easily verifiable solutions, that is, solutions 
whose validity can be quickly tested. Given some problem class, if some unknown 
optimal program p requires f{k) steps to solve a problem instance of size fc, then 
LSEARCH will need at most 0{f{k)/P{p)) = 0{f{k)) steps — the constant factor 
1/P{p) may be huge but does not depend on k. Compare |^ p. 502-505] and 
|23| and the fastest way of computing all computable universes in Section 

Recently Hutter developed a more complex asymptotically optimal search 
algorithm for all well-defined problems, not just those with with easily verifi- 
able solutions 1221 ■ HSEARCH cleverly allocates part of the total search time for 
searching the space of proofs to find provably correct candidate programs with 
provable upper runtime bounds, and at any given time focuses resources on those 
programs with the currently best proven time bounds. Unexpectedly, Hsearch 
manages to reduce the unknown constant slowdown factor of Lsearch to a value 
of 1 -|- e, where e is an arbitrary positive constant. 

Unfortunately, however, the search in proof space introduces an unknown 
additive problem class-specific constant slowdown, which again may be huge. 
While additive constants generally are preferrable over multiplicative ones, both 
types may make universal search methods practically infeasible. 

Hsearch and Lsearch are nonincremental in the sense that they do not 
attempt to minimize their constants by exploiting experience collected in previ- 
ous searches. Our method Adaptive Lsearch or Als tries to overcome this |60] 
— compare Solomonoff's related ideas |()4I65| . Essentially it works as follows: 
whenever Lsearch finds a program q that computes a solution for the current 
problem, g's probability P{q) is substantially increased using a "learning rate," 
while probabilities of alternative programs decrease appropriately. Subsequent 
LSEARCHes for new problems then use the adjusted P, etc. A nonuniversal vari- 
ant of this approach was able to solve reinforcement learning (RL) tasks [23 



in partially observable environments unsolvable by traditional RL algorithms 

Each LSEARCH invoked by Als is optimal with respect to the most recent 
adjustment of P. On the other hand, the modifications of P themselves are not 
necessarily optimal. Recent work discussed in the next section overcomes this 
drawback in a principled way. 

9 Optimal Ordered Problem Solver (OOPS) 

Our recent OoPS |53l55j is a simple, general, theoretically sound, in a certain 
sense time-optimal way of searching for a universal behavior or program that 
solves each problem in a sequence of computational problems, continually orga- 
nizing and managing and reusing earlier acquired knowledge. For example, the 
n-th problem may be to compute the n-th event from previous events (predic- 
tion), or to find a faster way through a maze than the one found during the 
search for a solution to the n — 1-th problem (optimization). 

Let us first introduce the important concept of bias-optimality, which is a 
pragmatic definition of time-optimality, as opposed to the asymptotic optimal- 
ity of both Lsearch and Hsearch, which may be viewed as academic exercises 
demonstrating that the 0{) notation can sometimes be practically irrelevant de- 
spite its wide use in theoretical computer science. Unlike asymptotic optimality, 
bias-optimality does not ignore huge constant slowdowns: 

Definition 1 (Bias-Optimal Searchers). Given is a problem class TZ, a 
search space C of solution candidates (where any problem r ^ TZ should have 
a solution in C), a task dependent bias in form of conditional probability distri- 
butions P{q I r) on the candidates q G C, and a predefined procedure that creates 
and tests any given q on any r G TZ within time t{q,r) (typically unknown in 
advance). A searcher is n-bias-optimal (n > 1) if for any maximal total search 
time T^ax > it is guaranteed to solve any problem r G TZ if it has a solution 
p € C satisfying t{p, r) < P(p \ r) Tmax/n. It is bias-optimal if n = 1. 

This definition makes intuitive sense: the most probable candidates should get 
the lion's share of the total search time, in a way that precisely reflects the initial 
bias. Now we are ready to provide a general overview of the basic ingredients of 
OOPS 1^3^: 

Primitives. We start with an initial set of user-defined primitive behaviors. 
Primitives may be assembler-like instructions or time-consuming software, such 
as, say, theorem provers, or matrix operators for neural network-like parallel 
architectures, or trajectory generators for robot simulations, or state update 
procedures for multiagent systems, etc. Each primitive is represented by a token. 
It is essential that those primitives whose runtimes are not known in advance 
can be interrupted at any time. 

Task-specific prefix codes. Complex behaviors are represented by token se- 
quences or programs. To solve a given task represented by task-specific program 
inputs, OOPS tries to sequentially compose an appropriate complex behavior from 



primitive ones, always obeying the rules of a given user-defined initial program- 
ming language. Programs are grown incrementally, token by token; their begin- 
nings or prefixes are immediately executed while being created; this may modify 
some task-specific internal state or memory, and may transfer control back to 
previously selected tokens (e.g., loops). To add a new token to some program pre- 
fix, we first have to wait until the execution of the prefix so far explicitly requests 
such a prolongation, by setting an appropriate signal in the internal state. Pre- 
fixes that cease to request any further tokens are called self-delimiting programs 
or simply programs (programs are their own prefixes). Binary self-delimiting 
programs were studied by (30j and jSj in the context of Turing machines (67| 
and the theory of Kolmogorov complexity and algorithmic probability |62I28| . 
Oops, however, uses a more practical, not necessarily binary framework. 

The program construction procedure above yields task-specific prefix codes on 
program space: with any given task, programs that halt because they have found 
a solution or encountered some error cannot request any more tokens. Given the 
current task-specific inputs, no program can be the prefix of another one. On a 
different task, however, the same program may continue to request additional 
tokens. This is important for our novel approach — incrementally growing self- 
delimiting programs are unnecessary for the asymptotic optimality properties of 
LSEARCH and Hsearch, but essential for OOPS. 

Access to previous solutions. Let p" denote a found prefix solving the first 
n tasks. The search for may greatly profit from the information conveyed 
by (or the knowledge embodied by) . . . ,p" which are stored or frozen in 

special nonmodifiable memory shared by all tasks, such that they are accessible 
to p"^^ (this is another difference to norancremental Lsearch and Hsearch). 
For example, might execute a token sequence that calls p"~^ as a subpro- 
gram, or that copies into some internal modifiable task-specific memory, 
then modifies the copy a bit, then applies the slightly edited copy to the current 
task. In fact, since the number of frozen programs may grow to a large value, 
much of the knowledge embodied by p^ may be about how to access and edit 
and use older [i < j). 

Bias. The searcher's initial bias is embodied by initial, user-defined, task de- 
pendent probability distributions on the finite or infinite search space of pos- 
sible program prefixes. In the simplest case we start with a maximum entropy 
distribution on the tokens, and define prefix probabilities as the products of 
the probabilities of their tokens. But prefix continuation probabilities may also 
depend on previous tokens in context sensitive fashion. 

Self-computed suffix probabilities. In fact, we permit that any executed pre- 
fix assigns a task-dependent, self-computed probability distribution to its own 
possible continuations. This distribution is encoded and manipulated in task- 
specific internal memory. So unlike with Als IfiOj we do not use a prewired 
learning scheme to update the probability distribution. Instead we leave such 
updates to prefixes whose online execution modifies the probabilities of their 
suffixes. By, say, invoking previously frozen code that redefines the probabil- 
ity distribution on future prefix continuations, the currently tested prefix may 



completely reshape the most likely paths through the search space of its own 
continuations, based on experience ignored by norancremental Lsearch and 
HSEARCH. This may introduce significant problem class-specific knowledge de- 
rived from solutions to earlier tasks. 

Two searches. Essentially, OOPS provides equal resources for two near-&ms- 
optimal searches (Def. ^ that run in parallel until is discovered and stored 
in non-modifiable memory. The first is exhaustive; it systematically tests all 
possible prefixes on all tasks up to ri + 1. Alternative prefixes are tested on all 
current tasks in parallel while still growing; once a task is solved, we remove it 
from the current set; prefixes that fail on a single task are discarded. The second 
search is much more focused; it only searches for prefixes that start with p", and 
only tests them on task n -I- 1, which is safe, because we already know that such 
prefixes solve all tasks up to n. 

Bias-optimal backtracking. Hsearch and Lsearch assume potentially infi- 
nite storage. Hence they may largely ignore questions of storage management. In 
any practical system, however, we have to efficiently reuse limited storage. There- 
fore, in both searches of OOPS, alternative prefix continuations are evaluated by 
a novel, practical, token-oriented backtracking procedure that can deal with sev- 
eral tasks in parallel, given some code bias in the form of previously found code. 
The procedure always ensures near- bias- optimality (Dcf. no candidate behav- 
ior gets more time than it deserves, given the probabilistic bias. Essentially we 
conduct a depth-first search in program space, where the branches of the search 
tree are program prefixes, and backtracking (partial resets of partially solved 
task sets and modifications of internal states and continuation probabilities) is 
triggered once the sum of the runtimes of the current prefix on all current tasks 
exceeds the prefix probability multiplied by the total search time so far. 

In case of unknown, infinite task sequences we can typically never know 
whether we already have found an optimal solver for all tasks in the sequence. 
But once we unwittingly do find one, at most half of the total future run time will 
be wasted on searching for alternatives. Given the initial bias and subsequent bias 
shifts due to p^,p^, ... , no other bias-optimal searcher can expect to solve the 
n-\- 1-th task set substantially faster than OOPS. A by-product of this optimality 
property is that it gives us a natural and precise measure of bias and bias shifts, 
conceptually related to Solomonoff 's conceptual jump size of |64I65| . 

Since there is no fundamental difference between domain-specific problem- 
solving programs and programs that manipulate probability distributions and 
thus essentially rewrite the search procedure itself, we collapse both learning and 
metalearning in the same time-optimal framework. 

An example initial language. For an illustrative application, we wrote an in- 
terpreter for a stack-based universal programming language inspired by Forth 
|85| . with initial primitives for defining and calling recursive functions, iterative 
loops, arithmetic operations, and domain-specific behavior. Optimal metasearch- 
ing for better search algorithms is enabled through the inclusion of bias-shifting 
instructions that can modify the conditional probabilities of future search op- 
tions in currently running program prefixes. 



Experiments. Using the assembler-like language mentioned above, we first 
teach OOPS something about recursion, by training it to construct samples of the 
simple context free language {1*^2*^} (fc I's followed by k 2's), for k up to 30 (in 
fact, the system discovers a universal solver for all k). This takes roughly 0.3 days 
on a standard personal computer (PC). Thereafter, within a few additional days, 
OOPS demonstrates incremental knowledge transfer: it exploits aspects of its pre- 
viously discovered universal 1'^2'^-solver, by rewriting its search procedure such 
that it more readily discovers a universal solver for all k disk Towers of Hanoi 
problems — in the experiments it solves all instances up to fc = 30 (solution size 
2*^ — 1), but it would also work for k > 30. Previous, less general reinforcement 
learners and nonlearning AI planners tend to fail for much smaller instances. 
Future research may focus on devising particularly compact, particularly rea- 
sonable sets of initial codes with particularly broad practical applicability. It may 
turn out that the most useful initial languages are not traditional programming 
languages similar to the FORTH-like one, but instead based on a handful of prim- 
itive instructions for massively parallel cellular automata (68t70-76i , or on a few 
nonlinear operations on matrix-like data structures such as those used in recur- 
rent neural network research I72l44l4j . For example, we could use the principles 
of OOPS to create a non-gradient-based, near-bias-optimal variant of Hochre- 
iter's successful recurrent network metalearner ^Uj. It should also be of interest 
to study probabilistic Speed Prior-based OOPS variants and to devise appli- 
cations of OOPS-like methods as components of universal reinforcement learners 
(see below). In ongoing work, we are applying OOPS to the problem of optimal 
trajectory planning for robotics in a realistic physics simulation. This involves 
the interesting trade-off between comparatively fast program-composing primi- 
tives or "thinking primitives" and time-consuming "action primitives", such as 
stretch- arm-until-touch- sensor-input. 



10 OOPS-Based Reinforcement Learning 

At any given time, a reinforcement learner |27) will try to find a policy (a strategy 
for future decision making) that maximizes its expected future reward. In many 
traditional reinforcement learning (RL) applications, the policy that works best 
in a given set of training trials will also be optimal in future test trials |51) . 
Sometimes, however, it won't. To see the difference between searching (the topic 
of the previous sections) and reinforcement learning (RL), consider an agent 
and two boxes. In the n-th trial the agent may open and collect the content of 
exactly one box. The left box will contain lOOn Swiss Francs, the right box 2" 
Swiss Francs, but the agent does not know this in advance. During the first 9 
trials the optimal policy is "open left box. " This is what a good searcher should 
find, given the outcomes of the first 9 trials. But this policy will be suboptimal in 
trial 10. A good reinforcement learner, however, should extract the underlying 
regularity in the reward generation process and predict the future tasks and 
rewards, picking the right box in trial 10, without having seen it yet. 



The first general, asymptotically optimal reinforcement learner is the recent 
AIXI model |22I24| (Section^. It is valid for a very broad class of environments 
whose reactions to action sequences (control signals) are sampled from arbitrary 
computable probability distributions. This means that AIXI is far more general 
than traditional RL approaches. However, while AIXI clarifies the theoretical 
limits of RL, it is not practically feasible, just like Hsearch is not. From a 
pragmatic point of view, what we are really interested in is a reinforcement 
learner that makes optimal use of given, limited computational resources. In 
what follows, we will outline one way of using OOPS-like bias-optimal methods 
as components of general yet feasible reinforcement learners. 

We need two OOPS modules. The first is called the predictor or world model. 
The second is an action searcher using the world model. The life of the entire 
system should consist of a sequence of cycles 1, 2, ... At each cycle, a limited 
amount of computation time will be available to each module. For simplicity we 
assume that during each cyle the system may take exactly one action. General- 
izations to actions consuming several cycles are straight- forward though. At any 
given cycle, the system executes the following procedure: 

1. For a time interval fixed in advance, the predictor is first trained in bias- 
optimal fashion to find a better world model, that is, a program that predicts 
the inputs from the environment (including the rewards, if there are any), 
given a history of previous observations and actions. So the n-th task [n = 
1,2,.. .) of the first OOPS module is to find (if possible) a better predictor 
than the best found so far. 

2. Once the current cycle's time for predictor improvement is used up, the 
current world model (prediction program) found by the first OOPS module 
will be used by the second module, again in bias-optimal fashion, to search for 
a future action sequence that maximizes the predicted cumulative reward (up 
to some time limit). That is, the n-th task (n = 1, 2, . . .) of the second OOPS 
module will be to find a control program that computes a control sequence 
of actions, to be fed into the program representing the current world model 
(whose input predictions are successively fed back to itself in the obvious 
manner), such that this control sequence leads to higher predicted reward 
than the one generated by the best control program found so far. 

3. Once the current cycle's time for control program search is used up, we will 
execute the current action of the best control program found in step 2. Now 
we are ready for the next cycle. 

The approach is reminiscent of an earlier, heuristic, non-bias-optimal RL ap- 
proach based on two adaptive recurrent neural networks, one representing the 
world model, the other one a controller that uses the world model to extract a 
policy for maximizing expected reward |46| . The method was inspired by previous 
combinations of nonrecurrent, reactive world models and controllers 73 3 7I26| . 

At any given time, until which temporal horizon should the predictor try to 
predict? In the AIXI case, the proper way of treating the temporal horizon is not 
to discount it exponentially, as done in most traditional work on reinforcement 



learning, but to let the future horizon grow in proportion to the learner's lifetime 
so far [23. It remains to be seen whether this insight carries over to OOPS-RL. 

Despite the bias-optimality properties of OOPS for certain ordered task se- 
quences, however, OOPS-RL is not necessarily the best way of spending limited 
time in general reinforcement learning situations. On the other hand, it is possi- 
ble to use OOPS as a proof-searching submodule of the recent, optimal, universal, 
reinforcement learning Godel machine j^Hl discussed in the next section. 

11 The Godel Machine 

The Godel machine |56j explicitly addresses the 'Grand Problem of Artificial 
Intelligence ' ,58; by optimally dealing with limited resources in general rein- 
forcement learning settings, and with the possibly huge (but constant) slow- 
downs buried by AIXI(t, I) ^21 in the somewhat misleading 0()-notation. It is 
designed to solve arbitrary computational problems beyond those solvable by 
plain OOPS, such as maximizing the expected future reward of a robot in a pos- 
sibly stochastic and reactive environment (note that the total utility of some 
robot behavior may be hard to verify — its evaluation may consume the robot's 
entire lifetime). 

How does it work? While executing some arbitrary initial problem solving 
strategy, the Godel machine simultaneously runs a proof searcher which system- 
atically and repeatedly tests proof techniques. Proof techniques are programs 
that may read any part of the Godel machine's state, and write on a reserved 
part which may be reset for each new proof technique test. In an example Godel 
machine |56| this writable storage includes the variables proof and switchprog, 
where switchprog holds a potentially unrestricted program whose execution could 
completely rewrite any part of the Godel machine's current software. Normally 
the current switchprog is not executed. However, proof techniques may invoke 
a special subroutine check() which tests whether proo/ currently holds a proof 
showing that the utility of stopping the systematic proof searcher and transfer- 
ring control to the current switchprog at a particular point in the near future 
exceeds the utility of continuing the search until some alternative switchprog is 
found. Such proofs are derivable from the proof searcher's axiom scheme which 
formally describes the utility function to be maximized (typically the expected 
future reward in the expected remaining lifetime of the Godel machine), the 
computational costs of hardware instructions (from which all programs are com- 
posed), and the effects of hardware instructions on the Godel machine's state. 
The axiom scheme also formalizes known probabilistic properties of the possi- 
bly reactive environment, and also the initial Godel machine state and software, 
which includes the axiom scheme itself (no circular argument here). Thus proof 
techniques can reason about expected costs and results of all programs including 
the proof searcher. 

Once check() has identified a provably good switchprog^ the latter is exe- 
cuted (some care has to be taken here because the proof verification itself and 
the transfer of control to switchprog also consume part of the typically limited 



lifetime). The discovered switchprog represents a globally optimal self-change in 
the following sense: provably none of all the alternative switchprogs and proojs 
(that could be found in the future by continuing the proof search) is worth 
waiting for. 

There are many ways of initializing the proof searcher. Although identical 
proof techniques may yield different proofs depending on the time of their invo- 
cation (due to the continually changing Godel machine state), there is a bias- 
optimal and asymptotically optimal proof searcher initialization based on a vari- 
ant of OOPS "SB* (Sectionl^l). It exploits the fact that proof verification is a simple 
and fast business where the particular optimality notion of OOPS is appropriate. 
The Godel machine itself, however, may have an arbitrary, typically different and 
more powerful sense of optimality embodied by its given utility function. 

12 Conclusion 

Recent theoretical and practical advances are currently driving a renaissance 
in the fields of universal learners and optimal search (SHj. A new kind of AI 
is emerging. Does it really deserve the attribute "new, " given that its roots 
date back to the 1930s, when Godel pubhshed the fundamental result of theo- 
retical computer science and Zuse started to build the first general purpose 
computer (completed in 1941), and the 1960s, when Solomonoff and Kolmogorov 
published their first relevant results? An affirmative answer seems justified, since 
it is the recent results on practically feasible computable variants of the old in- 
computable methods that are currently reinvigorating the long dormant field. 
The "new" AI is new in the sense that it abandons the mostly heuristic or non- 
general approaches of the past decades, offering methods that are both general 
and theoretically sound, and provably optimal in a sense that does make sense 
in the real world. 

We are led to claim that the future will belong to universal or near-universal 
learners that are more general than traditional reinforcement learners / decision 
makers depending on strong Markovian assumptions, or than learners based 
on traditional statistical learning theory, which often require unrealistic i.i.d. or 
Gaussian assumptions. Due to ongoing hardware advances the time has come for 
optimal search in algorithm space, as opposed to the limited space of reactive 
mappings embodied by traditional methods such as artificial feedforward neural 
networks. 

It seems safe to bet that not only computer scientists but also physicists 
and other inductive scientists will start to pay more attention to the fields of 
universal induction and optimal search, since their basic concepts are irresistibly 
powerful and general and simple. How long will it take for these ideas to unfold 
their full impact? A very naive and speculative guess driven by wishful think- 
ing might be based on identifying the "greatest moments in computing history" 
and extrapolating from there. Which are those "greatest moments"? Obvious 
candidates are: 



1. 1623: first mechanical calculator by Schickard starts the computing age (fol- 
lowed by machines of Pascal, 1640, and Leibniz, 1670). 

2. Roughly two centuries later: concept of a programmable computer (Babbage, 
UK, 1834-1840). 

3. One century later: fundamental theoretical work on universal integer-based 
programming languages and the limits of proof and computation (Godel, 
Austria, 1931, reformulated by Turing, UK, 1936); first working programmable 
computer (Zuse, Berlin, 1941). 

(The next 50 years saw many theoretical advances as well as faster and faster 
switches — relays were replaced by tubes by single transistors by numerous transis- 
tors etched on chips — but arguably this was rather predictable, incremental progress 
without radical shake-up events.) 

4. Half a century later: World Wide Web (UK's Berners-Lee, Switzerland, 
1990). 

This list seems to suggest that each major breakthrough tends to come roughly 
twice as fast as the previous one. Extrapolating the trend, optimists should 
expect the next radical change to manifest itself one quarter of a century af- 
ter the most recent one, that is, by 2015, which happens to coincide with the 
date when the fastest computers will match brains in terms of raw computing 
power, according to frequent estimates based on Moore's law. The author is 
confident that the coming 2015 upheaval (if any) will involve universal learning 
algorithms and Godel machine-like, optimal, incremental search in algorithm 
space ,56) — possibly laying a foundation for the remaining series of faster and 
faster additional revolutions culminating in an "Omega point" expected around 
2040. 
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